Making Sense of NBA Wingspan Data by Matt Ignal

Introduction

Prospect projections in basketball have always interested me, but given how often length and wingspan are referenced in scouting reports, I was surprised to find how little public work there is available making sense of wingspans and how this should factor into projections. It makes good basketball sense that having a longer wingspan should improve a player’s defensive ability and potential, since long arms are useful for clogging up passing lanes and protecting the rim, but can wingspan data improve upon box-score metrics (steals, blocks, fouls) to estimate a player’s defensive contributions?

The dataset presented here is taken from series of datasets I scraped and compiled from RealGM, Basketball Reference, and DraftExpress. The data contains only players who entered the NBA in 2003 or later. In order to simplify things for the purpose of this project, I removed 50+ columns from the data, leaving mainly defensive-oriented variables.

Box Plus/Minus is a box-score metric whose basis is a 14-year “RAPM” sample, a plus/minus-oriented (i.e. how does the team perform when a player is on the court?) metric. The basic idea is that a player’s box score could predict that player’s RAPM. This investigation will focus specifically on Defensive Box Plus/Minus, which is calculated simply by subtracting Offensive Box Plus/Minus (OBPM, also captured by box score) from overall Box Plus/Minus. A more detailed description can be found on Basketball Reference. As they caution:

“Box Plus/Minus is good at measuring offense and solid overall, but the defensive numbers in particular should not be considered definitive. Look at the defensive values as a guide, but don’t hesitate to discount them when a player is well known as a good or bad defender.”

To date, there is no definitive metric for capturing defensive value, so despite this warning, Box Plus/Minus is a worthy attempt to capture a player’s value. Some of its strengths as a metric include its availability throughout NBA history, and it generally has less year-to-year variance for players than other “advanced” metrics.

Wingspan data is much less complete, but is fairly well-catalogued in the 2000s and 2010s on DraftExpress’ pre-draft measurement database. The highest measured wingspan was taken because many of these measurements occur while a player is still growing. However, the highest measurement obviously isn’t guaranteed to be the most recent or accurate one, yet it should be enough to explore the theory is that having a longer wingspan should lead to better defense. This will be the central focus of the investigation.

Univariate Plots Section

##                Name          Season              Lg      
##  Joey Dorsey     :  21   Min.   :2003   NBA       :3636  
##  Mardy Collins   :  21   1st Qu.:2009   NCAA      :2007  
##  DeMarcus Nelson :  20   Median :2012   Euroleague: 294  
##  Jordan Farmar   :  20   Mean   :2012   Eurocup   : 240  
##  Paul Davis      :  20   3rd Qu.:2014   BSL       : 206  
##  Sergio Rodriguez:  20   Max.   :2016   CBA       : 161  
##  (Other)         :8063                  (Other)   :1641  
##       BLK                 BPM               DBPM              DRB         
##  Min.   : 0.000000   Min.   :-53.600   Min.   :-23.100   Min.   : 0.0000  
##  1st Qu.: 0.009219   1st Qu.: -3.500   1st Qu.: -1.500   1st Qu.: 0.1177  
##  Median : 0.045576   Median : -1.100   Median : -0.300   Median : 0.2094  
##  Mean   : 0.487775   Mean   : -0.993   Mean   : -0.085   Mean   : 2.9367  
##  3rd Qu.: 0.600000   3rd Qu.:  1.400   3rd Qu.:  1.300   3rd Qu.: 5.5000  
##  Max.   :16.800000   Max.   : 28.600   Max.   : 24.800   Max.   :35.4000  
##  NA's   :2           NA's   :4036      NA's   :4036      NA's   :2        
##        G               Ht              MP              OBPM        
##  Min.   : 1.00   Min.   :68.00   Min.   :   0.0   Min.   :-46.400  
##  1st Qu.:34.00   1st Qu.:76.00   1st Qu.: 303.8   1st Qu.: -2.800  
##  Median :60.00   Median :79.00   Median : 776.0   Median : -0.900  
##  Mean   :53.01   Mean   :78.84   Mean   : 889.1   Mean   : -0.909  
##  3rd Qu.:75.00   3rd Qu.:81.00   3rd Qu.:1193.0   3rd Qu.:  1.100  
##  Max.   :83.00   Max.   :90.00   Max.   :3388.0   Max.   : 47.800  
##  NA's   :4549                                     NA's   :4036     
##       ORB                 PF                STL              Wingspan    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.00000   Min.   :71.00  
##  1st Qu.: 0.04013   1st Qu.: 0.08258   1st Qu.: 0.03079   1st Qu.:79.50  
##  Median : 0.10303   Median : 0.13761   Median : 0.05579   Median :82.50  
##  Mean   : 1.18226   Mean   : 2.33716   Mean   : 0.73026   Mean   :82.35  
##  3rd Qu.: 1.60000   3rd Qu.: 4.40000   3rd Qu.: 1.40000   3rd Qu.:85.50  
##  Max.   :52.60000   Max.   :50.40000   Max.   :11.90000   Max.   :92.75  
##  NA's   :2          NA's   :2          NA's   :2          NA's   :2308   
##  Position      YFD         
##  B:2542   Min.   :-10.000  
##  G:2055   1st Qu.:  0.000  
##  S:1384   Median :  3.000  
##  W:2204   Mean   :  3.061  
##           3rd Qu.:  6.000  
##           Max.   : 13.000  
## 
## [1] 8185   17

We have 17 variables containing 8185 observations (player-seasons) in the raw NBA data. The dataset contains players who entered the NBA in 2003 or later.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   71.00   79.50   82.75   82.49   85.50   92.75     954

The median wingspan is 82.75 inches, while the maximum is 92.75 inches.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   68.00   76.00   79.00   78.97   82.00   90.00

The median height is 79.00 inches.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.0000 -1.4000 -0.3000 -0.1698  1.0000  7.5000

Most DPBMs fall within -3 and 2. There’s only one outlier after accounting for minutes played, but it’s a legitimate data point.

Most OBPMs fall below 0, and very few reach “elite” territory (+4 or more).

The player fouls, defensive rebounds, and blocks per 100 possessions were right-skewed so we will use a log transform. It makes sense that most players would get few blocks, rebounds and fouls but a small percentage would get many.

Log transformations of the three variables in question show normal distributions.

Most player’s have a slightly longer wingspan than their listed height, but I was surpised at the amount of variance. This might be worth looking at in more detail.

Univariate Analysis

What is the structure of your dataset?

We have 16 variables containing 8185 observations (player-seasons) in the raw NBA data. The dataset contains players who entered the NBA in 2003 or later.

Other observations: * DBPM mean is about 0 when controlling for minutes played. * Most players get few steals and blocks, but a small percentage get many more than others. * The average NBA player’s height is about 79 inches. Wingspans are generally longer, and the average is about 82.5 inches.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are Wingspan and Height. Can we use defensive statistics along with these variables to predict a player’s defensive value, measured by DBPM?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Defensive statistics like fouls, steals, and blocks per 100 possessions should help to create a model to predict a player’s defensive value.

Did you create any new variables from existing variables in the dataset?

Yes, I created Wingspan minus Height and Wingspan divided by Height. Perhaps the ratio or difference between Wingspan and Height will also play a role.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most distributions were normal, but defensive rebounds, blocks, and fouls per 100 possession clearly right-skewed. They display more normal distributions when log-transformed.

Bivariate Plots Section

## [1] 0.5266025

## [1] -0.2159396

The relationship between wingspan and DBPM appears linear, particularly for the bulk of the points. As expected, the relationship between wingspan and offense (measured via OBPM) is far weaker. The correlation between DBPM and wingspan is 0.53.

## [1] 0.4701667

The correlation between DBPM and height is 0.47.

Again, the relationship between height and DBPM appears linear. It’s difficult to say whether this is true for “actual defensive value,” but it’s clear DPBM favors taller and longer players.

Let’s organize by position. Perhaps while it’s better to have a larger wingspan, it’s especially good to have a large wingspan for one’s position. I had a friend go through the dataset to label positions according to modern standards, which are generally more fluid than traditional PG/SG/SF/PF/C. Guards generally defend PG/SG, Wings SG/SF, Swings SF/PF, and bigs PF/C. If my theory is true, the trendlines should have a roughly positive exponential curves.

Unfortunately for my theory, the pattern of these trendlines are more linear than exponential.

The shapes of the density curves for all positions are all broadly similar, with larger players.

Let’s look at height versus wingspan. Perhaps it is good for a player’s defense if they have “outsized” wingspans, or wingspans much larger than their heights would suggest. That way the player could have the wingspan of a larger player with a quickness of a smaller one.

## [1] 0.2532392

The correlation for wingspan - height with DBPM is 0.25, whereas the correlation for wingspan / height with DBPM is 0.22.

## [1] 0.2200685

The correlation for wingspan / height with DBPM is 0.22

Let’s look at the most-outsized wingspans (greater than +7) in more detail.

At first glance, this could be a random sample of NBA players, but some of the NBA’s best defenders are in this group: two-time reigning defensive player of the year in (swing) Kawhi Leonard as well as very adept rim-protecting bigs in Rudy Gobert, Anthony Davis, Bickmack Biyombo, Larry Sanders, and David West. Is there a pattern here?

## [1] 0.6274634

## [1] 0.204175

## [1] 0.204175

## [1] 0.6796349

The relationships between the variables (keeping the log transformations of fouls and defensive rebounds while taking the square root of blocks) with DBPM provide results that are roughly linear.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Height predictably varies linearly with wingspan. DBPM varies linearly with height and wingspan.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The difference between height and wingspan also varied linearly, albeit slightly, with DBPM. This is now a main feature of interest.

What was the strongest relationship you found?

Blocks, defensive rebounding, and wingspan most strongly correlated with DBPM.

Multivariate Plots Section

Again, we see the relationship between wingspan and DBPM isn’t very different among positions. There is more variance in the smallest and largest positions, as well as more data.

Jitter was added to this plot in order to reduce overplotting and make it easier to navigate through names. This graph indicates that wingspan to height ratio might be useful to include in a linear model. Let’s test a few, beginning with the ordinary rate stats and seeing how the additions of wingspan and height factor affect r-squared.

## 
## Calls:
## Model 1: lm(formula = DBPM ~ STL + sqrtBLK + logDRB + logPF, data = NBA)
## Model 2: lm(formula = DBPM ~ Ht + STL + sqrtBLK + logDRB + logPF, data = NBA)
## Model 3: lm(formula = DBPM ~ Wingspan + STL + sqrtBLK + logDRB + logPF, 
##     data = NBA)
## Model 4: lm(formula = DBPM ~ WingspanminHt + STL + sqrtBLK + logDRB + 
##     logPF, data = NBA)
## Model 5: lm(formula = DBPM ~ WingspandivHt + STL + sqrtBLK + logDRB + 
##     logPF, data = NBA)
## Model 6: lm(formula = DBPM ~ WingspandivHt + Wingspan + STL + sqrtBLK + 
##     logDRB + logPF, data = NBA)
## 
## ===================================================================================
##                   Model 1    Model 2    Model 3    Model 4    Model 5    Model 6   
## -----------------------------------------------------------------------------------
##   (Intercept)    -6.230***  -5.155***  -5.384***  -6.291***  -5.903***  -5.678***  
##                  (0.138)    (0.672)    (0.625)    (0.156)    (0.817)    (0.830)    
##   STL             1.135***   1.116***   1.129***   1.142***   1.142***   1.123***  
##                  (0.031)    (0.033)    (0.036)    (0.035)    (0.035)    (0.037)    
##   sqrtBLK         2.105***   2.143***   2.083***   2.041***   2.039***   2.087***  
##                  (0.059)    (0.064)    (0.074)    (0.068)    (0.068)    (0.075)    
##   logDRB          1.624***   1.689***   1.745***   1.691***   1.690***   1.761***  
##                  (0.067)    (0.078)    (0.082)    (0.074)    (0.074)    (0.087)    
##   logPF          -0.441***  -0.429***  -0.443***  -0.448***  -0.448***  -0.438***  
##                  (0.064)    (0.065)    (0.072)    (0.072)    (0.072)    (0.073)    
##   Ht                        -0.015                                                 
##                             (0.009)                                                
##   Wingspan                             -0.013                           -0.016     
##                                        (0.008)                          (0.011)    
##   WingspanminHt                                   -0.006                           
##                                                   (0.010)                          
##   WingspandivHt                                              -0.389      0.528     
##                                                              (0.777)    (0.980)    
## -----------------------------------------------------------------------------------
##   sigma             0.963      0.963      0.924      0.924      0.924      0.924   
##   R-squared         0.675      0.675      0.697      0.697      0.697      0.697   
##   F              1413.186   1131.775    941.686    940.408    940.324    784.514   
##   p                 0.000      0.000      0.000      0.000      0.000      0.000   
##   N              2728       2728       2049       2049       2049       2049       
## ===================================================================================

Whereas the addition of height doesn’t add anything to the ordinary rate stats when it comes to predicting DBPM, we see the three wingpsan stats bump up the r-squared value to 0.697. My basketball intuition led me to construct Model 5, in which I included the wingspan and wingspan to height ratio, but this produced no change in r-squared.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There was a lot more variance in wingspans for guards and bigs than for swings and wings. Generally speaking, DBPM tended to rise when one’s wingspan to height ratio increased, although this is not a rule: There are strong defensive seasons with ho-hum player-wingspan’s for their size like Joakim Noah and Andrew Bogut.

Were there any interesting or surprising interactions between features?

I suspected having a long wingspan for one’s position would be especially beneficial for DBPM, but my investigation showed that this wasn’t the case. It was interesting to find that players with more outsized wingspans generally had higher DBPMs.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried out several models using a combination of defensive stats along with height and wingspan data to create a multiple linear regression model to predict DBPM. Whereas adding height to the ordinary rate stats produced virtually no change, including the various wingspan stats produced a better predictive model. However, there was no difference in r-squared between the wingspan data. Perhaps a different approach to collecting wingspan data than the one outlined in the introduction would produce a higher r-squared.


Final Plots and Summary

Plot One

Description One

NBA wingspans follow a roughly normal distribution, with most wingspans falling between 76 and 88 inches, and a peak at around 83 inches.

Plot Two

Description Two

DBPM varies linearly with height in total and across all positions. DBPM and wingspan increase as one moves up a position, and there is a larger variety of both DBPM and wingspans among guards and bigs than wings and swings.

Plot Three

Description Three

Jitter was added to this plot in order to reduce overplotting and make it easier to navigate through names. Here we see that there are more greens and yellows above the trendline and more reds and oranges below. However, there is a less noticeable shift in colors while moving solely left to right. This indicates that the relationship between wingspan and height plays a role in DBPM.


Reflection

The basketball dataset I compiled contains 17 variables with 8185 observations. I began my investigation by looking at univariate data to get a feel for the distribution of defensive metrics, heights, wingspans, and Defensive Box Plus/Minus. I explored the relationship between these variables using summary statistics and plots and based off my observations, created new variables like wingspan-height difference and ratio, and tranformed some key metrics. Eventually, my investigation led me to create a series of linear models which I could compare to determine if wingspan could be used to be predict a player’s defensive contribution.

While defensive box-score metrics can get one a long way toward approximating a player’s defensive ability, adding variables like wingspan, height, and the wingspan to height ratio produces a small, but noticeable jump in explaining variance in DBPM. This indicates that wingspan is an important factor in estimating a player’s defensive ability and potential while not being crucial. A player with a mediocre wingspan for their height can still be an elite defender (as is the case with Joakim Noah and Andrew Bogut), but player’s with short wingspans for their height rarely are impactful. Having an “outsized” wingspan is undoubtedly helpful on defense, likely because one can move like a quicker player while having the length of a larger one. While see that many of these players are among the NBA’s best, there are still poor defenders with outsized wingspans. Attributes like lateral quickness, strength, tenacity, and focus are still central to any defensive projection.

A limitation of this investigation is related to the reliability of DBPM as a measure of defensive contribution. Defensive impact is notoriously hard to measure through box scores, and it would have been helpful to compare the central variables to alternative metrics like Real Plus-Minus and RAPM, but Real Plus-Minus only has three years worth of data and I only found RAPM data which aggregates all of a player’s seasons. Secondly, there was the reliability of wingspan data, for which uniformity is an impossible goal given that players often are still growing around the time when measurements are conducted. In addition, different organizations conduct the measurements. After some deliberation, I took the highest recorded measurement. Finally, this data is truncated because A) player data begins with the 2003 draft and therefore is skewed toward younger players, and B) players who are poor defenders may be excluded from the data since they did not play enough or are out of the league. Nevertheless, I was pleasantly surprised by the high r-squared value given how difficult it is to approximate defensive impact.

All in all, wingspan is an important part of defense, and given that it is helpful in measuring defensive contribution in a given season, it should be used in prospect projections as well, albeit not at the expense of other defensive attributes. In addition, exploring how age affects DBPM might give a better insight into the importance of wingspan in the NBA over the course of a NBA career.

Acknowledgements

Extensively used RStudio Cheatsheets. Data scraped from RealGM, Basketball Reference, and DraftExpress